43 research outputs found

    Analysing Data-To-Text Generation Benchmarks

    Get PDF
    Recently, several data-sets associating data to text have been created to train data-to-text surface realisers. It is unclear however to what extent the surface realisation task exercised by these data-sets is linguistically challenging. Do these data-sets provide enough variety to encourage the development of generic, high-quality data-to-text surface realisers ? In this paper, we argue that these data-sets have important drawbacks. We back up our claim using statistics, metrics and manual evaluation. We conclude by eliciting a set of criteria for the creation of a data-to-text benchmark which could help better support the development, evaluation and comparison of linguistically sophisticated data-to-text surface realisers

    Learning Embeddings to lexicalise RDF Properties

    Get PDF
    International audienceA difficult task when generating text from knowledge bases (KB) consists in finding appropriate lexicalisations for KB symbols. We present an approach for lexicalis-ing knowledge base relations and apply it to DBPedia data. Our model learns low-dimensional embeddings of words and RDF resources and uses these representations to score RDF properties against candidate lexicalisations. Training our model using (i) pairs of RDF triples and automatically generated verbalisations of these triples and (ii) pairs of paraphrases extracted from various resources, yields competitive results on DBPedia data

    A Statistical, Grammar-Based Approach to Microplanning

    Get PDF
    International audienceWhile there has been much work in recent years on data-driven natural language generation, little attention has been paid to the fine grained interactions that arise during micro-planning between aggregation, surface realization and sentence segmentation. In this paper, we propose a hybrid symbolic/statistical approach to jointly model these interactions. Our approach integrates a small handwritten grammar, a statistical hypertagger and a surface realization algorithm. It is applied to the verbalization of knowledge base queries and tested on 13 knowledge bases to demonstrate domain independence. We evaluate our approach in several ways. A quantitative analysis shows that the hybrid approach outperforms a purely symbolic approach in terms of both speed and coverage. Results from a human study indicate that users find the output of this hybrid statistic/symbolic system more fluent than both a template-and a purely symbolic grammar-based approach. Finally, we illustrate by means of examples that our approach can account for various factors impacting aggregation, sentence segmentation and surface realization

    Creating Training Corpora for NLG Micro-Planning

    Get PDF
    International audienceIn this paper, we focus on how to create data-to-text corpora which can support the learning of wide-coverage micro-planners i.e., generation systems that handle lexicalisation, aggregation, surface re-alisation, sentence segmentation and referring expression generation. We start by reviewing common practice in designing training benchmarks for Natural Language Generation. We then present a novel framework for semi-automatically creating linguistically challenging NLG corpora from existing Knowledge Bases. We apply our framework to DBpedia data and compare the resulting dataset with (Wen et al., 2016)'s dataset. We show that while (Wen et al., 2016)'s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of generating text from KB data

    Deep Graph Convolutional Encoders for Structured Data to Text Generation

    Get PDF
    Most previous work on neural text generation from graph-structured data relies on standard sequence-to-sequence methods. These approaches linearise the input graph to be fed to a recurrent neural network. In this paper, we propose an alternative encoder based on graph convolutional networks that directly exploits the input structure. We report results on two graph-to-sequence datasets that empirically show the benefits of explicitly encoding the input graph structure.Comment: INLG 201

    Bootstrapping Generators from Noisy Data

    Get PDF
    A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism. We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.Comment: NAACL 201

    Using FB-LTAG Derivation Trees to Generate Transformation-Based Grammar Exercices

    Get PDF
    International audienceUsing a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG), we present an approach for generating pairs of sentences that are related by a syntactic transformation and we apply this approach to create language learning exercises. We argue that the derivation trees of an FB-LTAG provide a good level of representation for capturing syntactic transformations. We relate our approach to previous work on sentence reformulation, question generation and grammar exercise generation. We evaluate precision and linguistic coverage. And we demonstrate the genericity of the proposal by applying it to a range of transformations including the Passive/Active transformation, the pronominalisation of an NP, the assertion / yes-no question relation and the assertion / wh-question transformation

    Generating Grammar Exercises

    Get PDF
    International audienceGrammar exercises for language learning fall into two distinct classes: those that are based on ''real life sentences'' extracted from existing documents or from the web; and those that seek to facilitate language acquisition by presenting the learner with exercises whose syntax is as simple as possible and whose vocabulary is restricted to that contained in the textbook being used. In this paper, we introduce a framework (called gramex) which permits generating the second type of grammar exercises. Using generation techniques, we show that a grammar can be used to semi-automatically generate grammar exercises which target a specific learning goal; are made of short, simple sentences; and whose vocabulary is restricted to that used in a given textbook

    Using Regular Tree Grammars to enhance Sentence Realisation

    Get PDF
    International audienceFeature-based regular tree grammars (FRTG) can be used to generate the derivation trees of a feature-based tree adjoining grammar (FTAG). We make use of this fact to specify and implement both an FTAG-based sentence realiser and a benchmark generator for this realiser. We argue furthermore that the FRTG encoding enables us to improve on other proposals based on a grammar of TAG derivation trees in several ways. It preserves the compositional semantics that can be encoded in feature-based TAGs; it increases efficiency and restricts overgeneration; and it provides a uniform resource for generation, benchmark construction, and parsing

    Building RDF Content for Data-to-Text Generation

    Get PDF
    International audienceIn Natural Language Generation (NLG), one important limitation is the lack of common benchmarks on which to train, evaluate and compare data-to-text generators. In this paper, we make one step in that direction and introduce a method for automatically creating an arbitrary large repertoire of data units that could serve as input for generation. Using both automated metrics and a human evaluation, we show that the data units produced by our method are both diverse and coherent
    corecore